Detect Language (Operator Toolbox)

Synopsis

This operator detects the language of a text in a table.

Description

This operator takes a nominal column with text and detects the language for each example. Besides the prediction of the most likely language, you also get a confidence, for how certain the algorithm is that prediction. Note, that while the scale is always [0-1], the concrete value may drop if you select more languages.

The operator allows the detection of over 70 languages. The more languages are selected, the longer the operator will run. It is often helpful to define the possible languages to reduce runtime of the operator. If you exclude a language it may happen that a text is associated with a wrong language (rather than unknown), but with a lower confidence. An example for this would be a Dutch text associated with German or a Portuguese text associated with Spanish. You may later on use a Filter Examples operator to remove uncertain predictions.

The algorithm/library used in this operator is lingua. For more details please see: https://github.com/pemistahl/lingua.

Input

  • exa (Data table)

    The example set with the column to detect texts on.

Output

  • exa (Data table)

    The result table with two additional columns: prediction(language) and confidence(language)

  • ori (Data table)

    The original data set passed through.

Parameters

  • text attribute The text attribute you want to detect languages for.
  • languages The languages which should be included in the scoring
    • all: Includes all 70+ languages. The full list of supported languages is available at https://github.com/pemistahl/lingua .
    • most_spoken: Includes the most spoken languages in the world. This includes: English, Chinese, Hindi, Spanish, French, Arabic, Bengali and Russian.
    • most_spoken_european: Includes the most spoken European languages. This includes: English, Spanish, French, Russian, German, Italian and Portuguese.
    • arabic_script: Includes all languages written in arabic script.
    • cyrillic_script: Includes all languages written in cyrillic script.
    • latin_script: Includes all languages written in latin script.
    • custom: Allows to select the languages manually using the language_selection parameter.
  • selected languages Allows the user to select the subset of languages to use for model detection.
  • fail on errorIf set to true the operator will fail if encounters a language it cannot process. This may be the case if the text is missing, empty, consists only of special characters or if the language has nothing in common with the language selection. If set to false this operator will add missing predictions and confidences in this case.

Tutorial Processes

Single Text Scoring